quality data
Not Every AI Problem Is a Data Problem
Membership in ACM includes a subscription to Communications of the ACM (CACM), the computing industry's most trusted source for staying connected to the world of advanced computing. Why we should be intentional about data scaling. Large language models (LLMs) have revolutionized the AI landscape, demonstrating remarkable capabilities across a wide range of tasks. Each new model seemingly reinforces the notion that modern transformer-based AI can conquer any challenge if armed with sufficient compute and data. However, while scaling has accelerated certain applications, such as robotics, it has yet to show significant impact in others, such as identifying misinformation.
Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling
Rodchenko, Tanya, Noy, Natasha, Scherrer, Nino, Prendki, Jennifer
For example, translation between languages exhibits regular and persistent patterns at different scales (across sentences, paragraphs, documents). In general, language patterns are stable over time. We know what type of data we need to expand to new languages. And while it may be challenging to acquire the data for rare or only spoken languages, it is easy to judge whether newly acquired data is what we need. In contrast, use cases where data lacks strong, persistent topological features or where the structure is highly fragmented or unstable over time, may not be as well-suited for data scaling approaches.
Novel Regression and Least Square Support Vector Machine Learning Technique for Air Pollution Forecasting
Air pollution is the origination of particulate matter, chemicals, or biological substances that brings pain to either humans or other living creatures or instigates discomfort to the natural habitat and the airspace. Hence, air pollution remains one of the paramount environmental issues as far as metropolitan cities are concerned. Several air pollution benchmarks are even said to have a negative influence on human health. Also, improper detection of air pollution benchmarks results in severe complications for humans and living creatures. To address this aspect, a novel technique called, Discretized Regression and Least Square Support Vector (DR-LSSV) based air pollution forecasting is proposed. The results indicate that the proposed DR-LSSV Technique can efficiently enhance air pollution forecasting performance and outperforms the conventional machine learning methods in terms of air pollution forecasting accuracy, air pollution forecasting time, and false positive rate.
The Significance of Data Quality in Making a Successful Machine Learning Model - KDnuggets
AI has been a buzzword for quite some time now and is highly ubiquitous. The AI-enabled applications have extensively increased in the market. We have also been'blessed' with powerful infrastructure and advanced algorithms. However, that does not make the journey of taking your ML project to production any easy. The issue in data quality is not new, it has gained attention since the onset of machine learning (ML) applications.
7 Tips for Value-Driven AI
How your business can improve the skills of its talent to take greater advantage of AI. There's no doubt that artificial intelligence (AI) is changing the way business is done today. AI will ultimately transform every business in every industry. However, despite their desire to use data science when making decisions, many organizations can't find enough qualified data scientists to develop and run their data science initiatives. Nonetheless, with online training and readily available tools, any software engineer -- or even a business user with a math background -- can become a data scientist.
Discretized Linear Regression and Multiclass Support Vector Based Air Pollution Forecasting Technique
Air pollution is a vital issue emerging from the uncontrolled utilization of traditional energy sources as far as developing countries are concerned. Hence, ingenious air pollution forecasting methods are indispensable to minimize the risk. To that end, this paper proposes an Internet of Things (IoT) enabled system for monitoring and controlling air pollution in the cloud computing environment. A method called Linear Regression and Multiclass Support Vector (LR-MSV) IoT-based Air Pollution Forecast is proposed to monitor the air quality data and the air quality index measurement to pave the way for controlling effectively. Extensive experiments carried out on the air quality data in the India dataset have revealed the outstanding performance of the proposed LR-MSV method when benchmarked with well-established state-of-the-art methods. The results obtained by the LR-MSV method witness a significant increase in air pollution forecasting accuracy by reducing the air pollution forecasting time and error rate compared with the results produced by the other state-of-the-art methods
No, You're Not Alone. Google Is Also Making This Big Mistake On AI
Just this past month, an article was shared that showed that over 30% of the data used by Google for one of their shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was full of mistakes. How could anyone using Google's model ever hope to trust the results if it's full of human-induced errors that computers can't fix. And Google isn't alone with major data mislabeling, an MIT study in 2021 found that almost 6% of the images in the industry-standard ImageNet database are mislabeled, and furthermore, found "label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets". How can we hope to trust or use these models if the data used to train those models is so bad?
Data Centric Artificial Intelligence
The data-centric artificial intelligence is the modern approach to building AI systems using quality data. The data-centric AI prioritizes the quality of data over the quantity of data, while traditional model-centric AI does the opposite. The key is better data, not big data! The key idea of data-centric AI is to handle data the same way as handling high-quality materials when building a house i.e. spend relatively more time labelling, augmenting, managing and curating the data. The traditional way is to optimize the highly parameterized models using big data and achieve high performance.
AI and Open Data
We are excited to announce a new project, AI and Open Data: Open Data Needs for AI and International Development. Governments, researchers, and civil society tackling development problems in the global south continue to face challenges of data access and availability. Cutting edge analytical techniques, like artificial intelligence (AI) and machine learning (ML) are promising to increase the effectiveness of development initiatives, but still require quality data as inputs. Open data is still as important for sustainable development as ever. As a field, AI receives significant optimism for its potential impact on sustainable development, including its potential to improve agricultural practices and productivity through aerial and remote sensing, monitor disease outbreaks, and plan and manage energy grids.
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Huynh, Jessica, Chiang, Ting-Rui, Bigham, Jeffrey, Eskenazi, Maxine
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.